E-Commerce Analytics: Insights for Growth & Retention¶

1. Introduction¶

Imagine you are managing an online gift shop based in the UK. Every day, hundreds of customers make purchases, and over the course of a year, there are more than 500,000 transactions. Naturally, as a business manager, you would want to know:

  • Which products bring in the most revenue?
  • Which customers are at risk of leaving if we don’t engage them?
  • How can we grow sales while keeping customers loyal?

Dataset Information¶

To answer these questions, I analyzed the Online Retail dataset from the UCI Machine Learning Repository, which contains 542,000 transactions from December 2010 to December 2011. This dataset mainly records purchases of unique, all-occasion gifts, often by wholesale customers.

In this project, I act as a data detective, exploring transactions, segmenting customers, predicting future high-value buyers, and uncovering products that are frequently bought together. The goal is to turn raw data into actionable insights that can guide marketing, sales, and customer retention strategies.

2. Load & Cleaning Data¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Before analyzing, it’s important to make sure the data is clean. Missing customer IDs, negative quantities, or incorrect prices could distort insights.

Load dataset¶

In [2]:
df = pd.read_excel("Online Retail.xlsx")

Clean data¶

In [3]:
# Remove missing CustomerID and invalid transactions
df = df.dropna(subset=["CustomerID"])
df = df[(df["Quantity"]>0) & (df["UnitPrice"]>0)]

Feature engineering¶

In [4]:
# Create TotalPrice
df["TotalPrice"] = df["Quantity"]*df["UnitPrice"]
# Convert InvoiceDate to datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
df["InvoiceMonth"] = df["InvoiceDate"].dt.to_period("M").astype(str)
In [5]:
# Get latest transaction date
snapshot_date = df["InvoiceDate"].max()

print("Cleaned dataset shape:", df.shape)
Cleaned dataset shape: (397884, 10)

After cleaning, the dataset contains roughly 398,000 valid transactions across 10 columns, ready for analysis.

In [6]:
df.head()
Out[6]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country TotalPrice InvoiceMonth
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30 2010-12
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 2010-12
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00 2010-12
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 2010-12
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 2010-12

3. Exploratory Data Analyst¶

Top Products¶

First, I asked: Which products generate the most revenue?

In [7]:
top_products = df.groupby("Description")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_products[::-1], x=top_products.values[::-1], y=top_products.index[::-1],
                      orientation='h', title="Top 10 Products by Revenue").show()

We notice that a few items, such as Paper Craft Little Birdie and Regency Cakestand 3 Tier, contribute disproportionately to total revenue. These “star products” are critical to the business.

Strategy: A small number of products dominate revenue, so focusing marketing efforts on these items, through promotions, bundles, or premium packages, can significantly boost revenue.

Country-Level Revenue¶

Next, I explored where most sales come from?

In [8]:
top_countries = df.groupby("Country")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_countries[::-1], x=top_countries.values[::-1], y=top_countries.index[::-1],
       orientation='h', title="Top 10 Countries by Revenue").show()

The UK dominates revenue, while other countries contribute less.

Strategy: The UK is the core market, so marketing and logistics should prioritize it. International markets, though smaller, offer opportunities for testing new products or localized campaigns.

Monthly Revenue Trend¶

We examine seasonality by looking at revenue trends throughout the year:

In [9]:
monthly_sales = df.groupby("InvoiceMonth")["TotalPrice"].sum().reset_index()
monthly_sales = monthly_sales.sort_values("InvoiceMonth")
px.line(monthly_sales, x="InvoiceMonth", y="TotalPrice",
        title="Monthly Sales Trend (Full Year)", markers=True).show()

Looking at sales across months, we notice clear ups and downs. Sales reached a peak in August, likely boosted by summer shopping trends or special promotions. However, November showed a noticeable dip, which could be the result of stock shortages or simply weaker customer demand.

Strategy: Prepare promotions and inventory ahead of busy months, and consider targeted campaigns or discounts during slower months to maintain steady revenue.

4. Customer Segmentation¶

Not all customers return after a purchase. Some high-value buyers who shopped frequently may have stopped, which can hurt revenue.

Using RFM analysis, we capture customer behavior:

  • Recency: How recently a customer purchased
  • Frequency: How often they purchase
  • Monetary: How much they spend
In [10]:
# RFM calculation
rfm = df.groupby("CustomerID").agg({
    "InvoiceDate": lambda x: (snapshot_date - x.max()).days,
    "InvoiceNo": "nunique",
    "TotalPrice": "sum"
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","TotalPrice":"Monetary"})

# Scaling
from sklearn.preprocessing import StandardScaler

rfm_scaled = StandardScaler().fit_transform(rfm)

By applying K-Means clustering, customers were automatically grouped based on their buying patterns.

Elbow method¶

In [11]:
from sklearn.cluster import KMeans

inertia = [KMeans(n_clusters=k, random_state=42).fit(rfm_scaled).inertia_ for k in range(1,8)]
plt.plot(range(1,8), inertia, 'o-')
plt.xlabel("k"); plt.ylabel("Inertia"); plt.title("Elbow Method")
plt.show()

The Elbow Method suggests that 3 clusters is optimal.

K-Means clustering¶

In [12]:
# Train K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
rfm["Segment"] = kmeans.fit_predict(rfm_scaled)

# Segment Profile
print("RFM Segment Profile:\n", rfm.groupby("Segment").mean().round(2))
RFM Segment Profile:
          Recency  Frequency   Monetary
Segment                               
0          39.98       4.85    2012.11
1         245.02       1.58     631.14
2           6.14      80.21  122888.41

Cluster Profiles:

  • VIPs: Loyal, frequent, high-spending customers—key revenue drivers
  • Mid-tier buyers: Moderate spenders with potential to become VIPs
  • At-risk customers: Infrequent buyers who may churn without engagement

Highlight at-risk customers¶

“At-risk” customers are those who haven’t purchased in a long time—the top 25% in Recency. Visualizing them in a Recency vs Monetary scatter plot helps identify who might churn:

In [13]:
rfm["AtRisk"] = rfm["Recency"] > rfm["Recency"].quantile(0.75)
rfm["AtRisk_Label"] = rfm["AtRisk"].map({True:"At-Risk", False:"Active"})

fig = px.scatter(rfm, x="Recency", y="Monetary", color="AtRisk_Label", size="Frequency",
                 hover_data=["Recency","Frequency","Monetary"], title="Recency vs Monetary")
fig.show()

Think of them as friends who haven’t visited in a while. Without engagement, they may never return, representing potential lost revenue.

Customer Dynamics:

  • Top-right: At-Risk High Spenders — once loyal, now inactive. Losing them would hit revenue hard
  • Top-left: Active High Spenders — VIPs who shop frequently and sustain the business
  • Bottom-right: At-Risk Low Spenders — low-value but numerous; their collective churn matters
  • Bottom-left: Active Low Spenders — modest spenders, still engaged and can grow with upselling or cross-selling

Strategy:

  • Reward VIPs with exclusive perks or early access
  • Encourage mid-tier buyers to spend more through personalized offers
  • Reactivate at-risk customers via emails, discounts, or reminders

5. High-Value Customer Prediction¶

The next question: Which customers are likely to spend the most in the future?

High-value customers are the top 25% of spenders. Using patterns from past transactions—recency, frequency, average spend, and country—we trained a Random Forest Classifier to predict future high-value buyers.

In [14]:
# Define target
y = (rfm["Monetary"] > rfm["Monetary"].quantile(0.75)).astype(int)

# Features
X = df.groupby("CustomerID").agg({
    "InvoiceDate": lambda x: (snapshot_date - x.max()).days,
    "InvoiceNo": "nunique",
    "UnitPrice": "mean",
    "Quantity": "mean",
    "Country": lambda x: x.mode()[0]})
X = pd.get_dummies(X, columns=["Country"], drop_first=True)

# Scaling
X_scaled = StandardScaler().fit_transform(X)

Train Test split¶

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
                                                    random_state=42, stratify=y)

Random Forest Classifier¶

In [16]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
clf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')
clf.fit(X_train, y_train)

# Predict probabilities and classify high-value customers
y_proba = clf.predict_proba(X_test)[:,1]
y_pred = (y_proba >= 0.4).astype(int)  # threshold 0.4 to capture more high-value customers

Model Evaluation¶

In [17]:
from sklearn.metrics import classification_report, roc_auc_score

print("ROC-AUC:", roc_auc_score(y_test, y_proba).round(3))
print(classification_report(y_test, y_pred))
ROC-AUC: 0.943
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       651
           1       0.81      0.75      0.78       217

    accuracy                           0.89       868
   macro avg       0.87      0.84      0.85       868
weighted avg       0.89      0.89      0.89       868

Results:

  • ROC-AUC: 0.943 — excellent ability to distinguish high-value customers
  • Precision: 81% — when the model predicts a high-value customer, it is correct 81% of the time
  • Recall: 75% — 75% of actual high-value customers are correctly identified

The model is reliable. We can now proactively engage customers who are most likely to contribute significant revenue, instead of waiting for them to make purchases on their own.

Strategy:

  • Send personalized offers or promotions to likely high-value customers
  • Focus marketing resources on customers with the highest ROI
  • Develop loyalty programs or exclusive bundles to retain top spenders

6. Market Basket Analysis¶

Another way to boost revenue is understanding which products are often bought together. If certain items are frequently purchased as a pair, the business can create bundles or encourage cross-selling.

We analyzed transactions from UK customers using FP-Growth, a method to detect frequently co-purchased items.

In [18]:
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Create basket: UK transactions pivoted by InvoiceNo and Description
basket = (df[df['Country'] == "United Kingdom"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().fillna(0))
# Convert quantities to boolean
basket = basket.applymap(lambda x: x > 0)

# FP-Growth & association rules
frequent_itemsets = fpgrowth(basket, min_support=0.01, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2) \
        .sort_values('lift', ascending=False) \
        .head(5)

print("Top 5 Market Basket Rules:\n", rules[['antecedents','consequents','support','confidence','lift']])
Top 5 Market Basket Rules:
                       antecedents                      consequents   support  \
541        (HERB MARKER ROSEMARY)              (HERB MARKER THYME)  0.010153   
540           (HERB MARKER THYME)           (HERB MARKER ROSEMARY)  0.010153   
914    (REGENCY TEA PLATE GREEN )       (REGENCY TEA PLATE ROSES )  0.011534   
915    (REGENCY TEA PLATE ROSES )       (REGENCY TEA PLATE GREEN )  0.011534   
617  (POPPY'S PLAYHOUSE BEDROOM )  (POPPY'S PLAYHOUSE LIVINGROOM )  0.010153   

     confidence       lift  
541    0.933702  86.829038  
540    0.944134  86.829038  
914    0.845815  52.930211  
915    0.721805  52.930211  
617    0.650000  51.769856  

For example, Herb Marker Rosemary and Herb Marker Thyme are almost always bought together, while items in the Regency Tea Plate set are commonly purchased as a pair. Similarly, Poppy’s Playhouse Bedroom and Livingroom items are frequently bought together. These patterns reveal natural associations in customer purchases, which can help the business suggest complementary items, increase the average order value, and enhance the overall shopping experience.

Strategy

  • Display “Frequently Bought Together” recommendations on product pages.
  • Create bundled offers that include associated items.
  • Encourage customers to explore related products, boosting cross-selling opportunities.

7. Conclusion¶

From nearly 398,000 transactions, several important insights emerged. Revenue is concentrated on a few best-selling products and is strongly driven by the UK market. Customer segmentation revealed distinct groups—VIPs, mid-tier buyers, and at-risk customers—allowing for more focused engagement strategies. Predictive modeling made it possible to identify future high-value customers before they spend, enabling proactive retention. Meanwhile, market basket analysis highlighted natural product pairings, opening opportunities for bundling and cross-selling.

Together, these insights provide a roadmap for making data-driven decisions: strengthening loyalty among top customers, reactivating those at risk, and boosting sales through smarter product strategies. The result is not just higher revenue, but also sustainable growth in the competitive e-commerce landscape.